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We discuss the application of a class of machine learning algorithms known as 
decision trees to the process of galactic classification. In particular, we explore the 
application of oblique decision trees induced with different impurity measures to the 
proble m of classifying galactic morphology data provided by Storrie-Lombardi et al. 
( 1992 ). Our results are compared to those obtained by a neural network classifier created 
by Storrie-Lombardi et al, and we show that the two methodologies are comparable. 
We conclude with a demonstration that the original data can be easily classified into 
less well-defined categories. 
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1 INTRODUCTION 



catalogues - galaxies: fundamental parameters. 



classification. In particular, SLSS used their neural net to 



Decision tree algorithms have proven themselves usetul as 



automated classifiers in a number of asLrouomical domains. 

An oblique decision tree algorithm created by Murthy, Kasif, 
& Salzberg (1994) has demonstrated that decision trees can 
be generated to distinguish between stars and galaxies or to 
identify cosmic rays in Space Telescope images. Typically, 
the trees produced by this algorithm possess accuracies up 
to and excee ding 95%, and can n ow be used to classify addi- 
tional data (Salzberg et al. 1994). In this way, decision trees 
free researchers from the tedious task of object classification. 

The next logical step for a decision tree classification 
algorithm is, of course, the classification of galaxie s mor - 
phologically. As Storrie-Lombardi et al. point out (1992), 
this has been attempted by a number of researchers with 
limited success; today "morphological classification into el- 
lipticals, lenticulars, spirals and irregulars remains a process 
dependent on the eyes of a handful of dedicated individuals." 

In an attempt to rectify this situation, Storrie-Lombardi 
et al. (hereafter referred to as SLSS) have applied a com- 
puting technique known as artificial neural networks (also 
known as neural nets or ANNs), to the problem of galaxy 



classify galaxies t aken from the ESO-LV catalog (Laubert 
& Valentijn 1989). In this paper, we compare results from 



the SLSS ANN with those from decision trees. 



2 NEURAL NETS AND DECISION TREES 

Here we will describe some of the fundamental differences 
between the neural network algorithm implemented by SLSS 
and decision trees. Our discussion will focus only on the 
dominant aspects of these algorithms. For more informa- 
tion, the interested reader is referred to the SLSS paper o n 
neural network classification ( Storrie-Lombardi et al. 1992 ) . 
For an int roduction to decision trees, Quinlan's original pa- 
per (1986) on the topic is an excellent starting point. An 
in-depth look at the original oblique decision tree algorithm 
used to generate the trees for our e xperiments can be found 
in a paper by Murthy et al. ( 1994 ) . 
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Table 1. SLSS results with an ANN. Overall accuracy = 64.1%. 



Class 


E 


SO 


Sa+Sb 


Sc+Sd 


Irr 


E 


203 


77 


25 


1 


5 


SO 


109 


229 


240 


7 


2 


Sa+Sb 


12 


85 


1281 


218 


15 


Sc+Sd 


1 


4 


304 


415 


36 


Irr 








53 


69 


126 



Table 2. Lauberts & Valentijn's results with their own automated 
classifier. Overall accuracy = 56.3%. 



Class 


E 


SO 


Sa+Sb 


Sc+Sd 


Irr 


E 


197 


87 


17 


5 


5 


SO 


184 


218 


155 


28 


2 


Sa+Sb 


106 


12 


791 


664 


38 


Sc+Sd 


22 


11 


24 


631 


72 


Irr 


22 


9 


31 


42 


144 



2.1 Neural Networks 

An artificial neural network consists of nodes, roughly anal- 
ogous to human neurons, arranged in a series of layers. All 
the nodes of each layer can be fully connected to the nodes 
in the next. Weights between nodes indicate how a series 
of inputs are to be transformed into output; for example, 
how attributes describing an object (input) determine the 
object's classification (output). "Hidden nodes" occur be- 
tween the input layer of nodes and the output layer. Node 
weights are updated via a hill-climbing error-minimization 
procedure. In other words, as the neural net learns, weights 
are updated such that the overall accuracy of the neural net 
improves. 

The error-minimization procedure implemented by 
SLSS is known as back-propagation. Back-propagation mod- 
ifies weights from the output nodes backwards to the nodes 
accepting input. As mentioned, weights are only altered 
when back-propagation improves the neural net's overall ac- 
curacy. Of course, to determine overall accuracy, some kind 
of pre-classified data is required (discussed below). 

The ANN implemented by SLSS contains 13 input 
nodes, 13 hidden nodes, and 5 output nodes. The 5 out- 
put nodes correspond to the five target classifications the 
neural net has been trained to produce. A classification is 
based on the output node with the largest value. Following 
a crude Hubble sequence, the five classes chosen by SLSS 
are: E, SO, Sa+Sb, Sc+Sd, and Irr. 

Because the ANN developed by SLSS is a supervised 
neural network, the ANN must first be trained on some pre- 
classified set of data. Subsets of this data were reserved and 
used to test the accuracy of the neural net. Because the 
ANN must first be trained, its accuracy is wholly dependent 
on the accuracy of the training data. Good training data is 
essential to the production of a good classifier. 



The results obtained by the SLSS neural net are given 
in table 1. 



2.2 Decision Trees 

A decision tree can be thought of as the outline of a deci- 
sion process. As with any tree-like data structure, a decision 
tree consists of both internal and leaf nodes. Internal nodes 
correspond to choices to be made from the set of training 
data; leaf nodes correspond to conclusions. A path from the 
root node to a specific leaf constitutes a decision. 

Throughout the rest of this paper, we will be concerned 
only with binary decision trees. Binary decision trees can 
make only yes/no decisions at each internal node. 

To determine the classification of a specific object, the 
attribute describing the objects are passed through the de- 
cision tree starting at the root. Each internal node might 
contain a test of the form a;X > k, where a; is the iih at- 
tribute for any example X, and k is a test value. This defines 
a one-dimensional hyperplane across attribute i. 

A decision is based on whether a specific attribute o» 
for a given example is greater or less than a value k stored 
in the tree. If the example's attribute is greater, then the 
"yes" branch is followed, otherwise the opposite "no" branch 
will be taken. This process continues recursively until a leaf 
node (a conclusion) is reached. Tests of this form - tests 
that classify objects by dividing a data space - are known 
as splits. Splits that segment a space only along a single 
dimension (with a single attribute) such as the above are 
commonly referred to as axis-parallel splits. 

Another test that could be performed at each node is 
an oblique split: 

d 

+ a d+1 > 

i=l 

where each example X has d attributes. Here, decisions are 
made by determining whether an object lies above or below 
a d-dimensional hyperplane defined by each attribute a;. 
As with axis-parallel tests, whether or not an example lies 
above the hyperplane indicates whether the "yes" or "no" 
branch will be followed. Splits of this form are referred to as 
oblique. 

The particular trees we have grown for our experiments 
contain almost exclusively oblique splits and will be called 
oblique decision trees. Oblique decision trees are known to 
require, in general, far fewer nodes to accurately describe 
data. This is because oblique trees take advantage of Oc- 
cam's razor. Intuitively, given two trees, each equally accu- 
rate on a set of training data, the decision tree with the fewer 
number of nodes will be expected, in general, to make more 
accurate decisions on new, previously unseen data. The algo- 
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rithm we use to construct our decision trees is an extension 
of Murthy et al.'s Oblique Classifier f (OCf ) algorithm. 



3 TRAINING OBLIQUE TREES WITH OC1 

Like the SLSS ANN, OC1 learns how to classify a data set 
by first training on a pre-classified subset of the data. The 
decision tree grown from this training set can then be used to 
classify the remainder of the data as well as new examples. 

As described above, OC1 uses both oblique and axis- 
parallel hyperplanes to partition sets of data. The internal 
nodes of a tree generated by OC1 therefore constitute ei- 
ther axis-parallel or oblique splits. OC1 searches through 
the data to find the best split for each node. The quality 
of a split is determined by a measure known as impurity. 
Impurity is a heuristic measure of how poorly a certain split 
will separate data. The goal of OC1 is to find splits which 
minimize the overall impurity of a decision tree. 

While most of the searching that OC1 performs is lo- 
cal deterministic hill-climbing (although, as we are using 
impurity measures, it would be more accurate to say "hill 
descending"), some randomization has been introduced to 
determine placement of the initial hyperplane and to escape 
from local minima. This stochastic component of the algo- 
rithm is necessary because of the enormous size of the search 
space: Given n objects with d dimensions (attributes), the 
number of distinct oblique hyperplanes that can separate 
the n objects is 2 d ■ ru ■ This is a much greater value than 
the n ■ d distinct axis-parallel splits that can separate n ob- 
jects. By performing multiple local searches in this man- 
ner, OC1 can come very close to optimal solutions without 
the overhead of an exhaustive search. In fact, as Heath has 
shown, the pr oblem of find ing an optimal oblique split is 
NP- Complete (Heath 1995). In other words, it is doubtful 
that any algorithm could find an optimal oblique split in an 
amount of time that is a polynomial function of n and d: 
such an algorithm would more than likely require an expo- 
nential amount of processing time. OC1 takes the best split 
it can find for an internal node before moving recursively 
to the next. Thus, OC1 performs a greedy search for each 
split. OC1 does not use any form of "lookahead" to deter- 
mine whether potentially ba d spli ts might result in good 
trees. Murthy and Salzberg (1994) have shown that such 
a mechanism provides only marginal, if any, improvement. 
Like OC1, most decision tree algorithms restrict their search 
to the space of data, rather than searching unnecessarily 
through the spac e of trees. 

Heath ( 1995 ) points out that the stochastic element of 
algorithms such as OC1 can be advantageous. By generating 
multiple classifiers from a set of data, classification of new 
data can be determined by popular vote. That is, the most 
common classification of an object among multiple classifiers 
determines the object's overall class; thereby reducing the 



chance of classification error. Thus, OC1 could be used to 
generate multiple decision trees, a decision forest, from a set 
of training data. The classification of a new object would be 
determined by the most common classification to occur in 
the decision forest. 



3.1 Impurity Measure 

The impurity measure we have chosen to use with OC1 is 
known as the "twoing" criterion and can be defined as fol- 
lows (Breiman et al. 1984, Salzberg et al. 1994): 

(PL-p«)(^|(p(j|L)-p(j|i?))|) 2 

3 

where pl and pn are, respectively, the proportion of exam- 
ples on the left and right side of a split, and both p(j\L) and 
p(j\R) represent the proportions of class j on the left and 
right sides of the split. This criterion assigns higher values 
to hyperplanes that come close to splitting the data in half 
and to hyperplanes that split cleanly between classes. When 
examples from the same class are split apart, the values re- 
turned by the twoing criterion indicate increased impurity. 
As mentioned above, OC1 strives to minimize this measure 
of impurity for each split. 



4 EXPERIMENTS WITH DECISION TREES 

Here we repeat the experiments performed by SLSS with 
the decision tree generating algorithm, OC1. We also include 
some experiments of our own. As in the SLSS experiments. 



data is ta ken from the ESO-LV catalog (Lauberts & Valen- 



tijn 1989). The 13 parameters (attributes) used to describe 
each object in their experiments are given below (taken from 
SLSS 1992): 

• < B — R > : average color in region with B surface bright- 

ness 20.5 to 26. 

• ^oct- exponent of the fit of a generalized de Vaucouleurs 

law to [the galaxy profile] B octants (N = 0.25 corre- 
sponds to a perfect elliptical galaxy and N = 1 to a pure 
exponential disk). 

• log(Dg /De), where Dg and Df are the major diame- 

ters of the ellipses at 80 per cent and half total B light 
respectively. 

• V'°5: arctangent of the absolute value of the ratio of the 

mean tangential and radial gradients, which is an indi- 
cator of the degree of asymmetry of the galaxy image. 

• ^oct'- B central surface brightness from the fit of a gener- 

alized de Vaucouleurs law to B octants. 

• log{b/a), where b/a is the galaxy axial ratio. 
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• El^. : error in ellipse fit to B isophotes at B surface bright- 

ness 23. 

• V_h c : gradient of the B surface brightness profile at Df . 

• log(Df 6 / Df), where Df 6 is the major diameter of the 

ellipse at 26 B mag arcsec~ 2 . 

• Nf~ ct : exponent of the fit of a generalized de Vaucouleurs 

law to R octants. 

• (if : average B surface brightness within 10 arcsec diameter 

circular aperture. 

• (if : B surface brightness at half total B light. 

• (if: R surface brightness at half total R light. 

Only those galaxies with ESO visual diameter > 1 ar- 
cmin and at high Galactic latitude (\b\ > 30deg) were con- 
sidered. All galaxies have been morphologically classified 
via visual examination. According to SLSS, these attributes 
were chosen because they are distance independent and be- 
cause they are very similar to those used by Lauberts & 
Valentijn to perform their own automated classification (re- 
sults of which are presented in table 2). 

The final data set contains 5217 galaxies. SLSS ran- 
domly sorted this data into two sets of 1700 and 3517 ob- 
jects to be used, respectively, for training and testing. We 
trained our decision trees using both this method and with a 
five- fold cross-validation experiment. The five major classes, 
determined by Lauberts & Valentijn, were binned based on 
the following criteria: E (—5.0 < T < —2.5, 466 galaxies), 
SO (-2.5 < T < 0.5, 851 galaxies), Sa+Sb (0.5 < T < 4.5, 
2403 galaxies), Sc+Sd (4.5 < T < 8.5, 1132 galaxies), and 
Irr (8.5 < T < 10.0, 365 galaxies), where T is an object's 
type. 

Tables 1 and 2 compare the performance of the SLSS 
ANN to the Lauberts & Valentijn classifier. Rows in these 
tables reveal visual type distributions; columns depict au- 
tomated type distribution. Left to right diagonals are the 
values for which both human and automated classifiers per- 
fectly agree. Overall accuracy for each of the classifiers is 
given in the caption above each table. Notice that the ANN 
produces superior performance. 

Table 3 outlines the overall performance of OC1 using a 
five- fold cross-validation experiment. Cross-validation stud- 
ies of this sort are known to produce relia ble estimates of 



acc uracy while avoiding "optimistic" bias (Salzberg et al 
1994 |). j The "leaves" column denotes the number of possible 
decisions contained within each tree. To perform five-fold 
cross-validation, all 5217 galaxies were split into five equal- 
sized sets. 4/5 of this data was reserved for training and the 
remaining 1/5 for testing. The process was repeated another 
four times, thus allowing each set to be used once as a test 
set. 

The left half of table 3 demonstrates OCl's performance 
with a modest amount of random search; the right half illus- 
trates the performance of a more exhaustive search. From 



Table 3. Five-fold cross-validation on all 5217 objects. Right: 
Modest random search (5 random perturbations, 20 restarts); 
Left: Extensive random search (100 random perturbations; 100 
random restarts). 



Fold 
1 
2 
3 
4 
5 

Average 



Modest Search 
Accuracy Leaves 



64.5 
64.4 
62.3 
64.1 
63.8 
63.8 



11 
7 

8 
9 
9 



Extensive Search 

Accuracy Leaves 

64.8 15 
65.3 9 

62.9 8 
65.3 8 
64.7 10 
64.6 10.0 



Table 4. Results obtained with AP-OC1. Overall accuracy 
63%. Number of leaves = 43. 



Class 


E 


SO 


Sa+Sb 


Sc+Sd 


Irr 


E 


207 


75 


23 


2 


4 


SO 


101 


227 


250 


7 


2 


Sa+Sb 


5 


78 


1336 


175 


17 


Sc+Sd 





3 


390 


302 


65 


Irr 








57 


51 


140 



this table it is clear that additional search result in only a 
marginal accuracy improvement (on average, less than 1%). 
Interestingly, the improvement in accuracy seems to require 
an increase in the number of leaf nodes (decisions) contained 
within each tree. Because additional search requires more 
processing time, we use only the default, modest degree of 
search in the remaining experiments. 

The results in table 4 were obtained by running OC1 
exclusively in axis-parallel mode (AP-OC1), which requires 
no random search. Like SLSS, only 1700 randomly chosen 
objects were used for training. Cross-validation was not used 
so that our experiments would remain consistent with those 
performed by SLSS. Notice that even with this very simple 
method of constructing decision trees, we are but one per- 
cent away from the overall accuracy obtained by the SLSS 
neural net (table 1). 

Table 5 displays results of the most accurate tree taken 
from five runs of OC1 with different random seeds. Again 
only 1700 randomly chosen objects were used for training. 
Notice that the incorporation of oblique splits has resulted 
in a tree that is half the size of the the axis-parallel tree. 



Table 5. Results obtained with the most accurate tree produced 
by OC1. Overall accuracy = 63%. Number of leaves = 23. 5 ran- 
dom perturbations; 20 random restarts. 



Class 


E 


SO 


Sa+Sb 


Sc+Sd 


Irr 


E 


188 


102 


17 


1 


3 


SO 


84 


259 


236 


7 


1 


Sa+Sb 


3 


100 


1279 


220 


9 


Sc+Sd 





1 


335 


375 


19 


Irr 








19 


80 


119 
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Table 6. Accuracy and tree size (in number of leaves) of non- 
neighbor decision trees. 



Sa + Sb Sc + Sd 

Galaxy classification 



Figure 1. Distribution of classification for the test sample (3517 
galaxies). White bars indicate visual classification, light gray in- 
dicate Lauberts & Valentijn's automated classifier, dark gray is 
the SLSS ANN; black represents 0C1. 



Still, this tree is large for an oblique decision tree. Most 
trees produced by OC1 are much smaller. In one case, for 
example, by using a different information measure known as 
the gini index, an oblique tree composed of 6 leaves with 
62% accuracy was constructed. The gini index, which is a 
measure of the probability of misclassification over a set of 
instances, has been modified for use as an impurity measure 
in the OC1 package (Murthy et al. 1994, Breiman et al. 
1984). Different methods for evaluating the quality of a split, 
including the gini index and the twoing criterion, extend the 
power and flexibility of OC1. 

The five randomized OC1 runs produced trees with an 
average of 17 leaves and an average accuracy of 61%. Appar- 
ently the reduction in training examples has had an adverse 
affect on the quality of our decision trees. Furthermore, while 
normally we would expect the smaller trees to be more accu- 
rate classifiers, the fact that one of the larger trees achieves 
the highest accuracy indicates that the data is complex and 
difficult to generalize. Our result from table 3, that addi- 
tional accuracy seems correlated to additional leaves, also 
supports this conclusion. 

Figure 1 compares classification distributions from each 
of the classifiers to the original visual distribution. Notice 
how closely related the neural net and decision tree distri- 
butions are, even though none of the output from any of 
the automated classifiers parallels the visual classifications 
with considerable accuracy. While Lauberts & Valentijn's 
ESO classifier seems to reverse the distribution of Sa+Sb 
and Sc+Sd galaxies found in the visual classification, both 
the ANN and decision tree reflect this pattern. 



Tree Acc Lvs 

1. E / Sa+Sb 96.4 3 

2. E / Sc+Sd 97.3 2 

3. E / Irr 95.7 2 



5 DISCUSSION 



Tree Acc Lvs 

4. SO / Sc+Sd 91.4 2 

5. SO / Irr 95.7 2 

6. Sa+Sb / Irr 92.8 3 



We have shown that decision trees can be used to determine 
the morphological classification of galaxies with reasonable 
success. Furthermore, our comparisons of a neural net classi- 
fier to that of a decision tree algorithm have produced similar 
results. Errors made by both classifiers can be attributed to 
one or more of the following problems: 1) there are errors in 
the visual classification of the 5217 galaxies which comprise 
both the training and test sets, or 2) none of the attributes 
used to describe the data provides sufficient information for 
accurate classification. SLSS account for this by noting that 
the classifications are based on plate material rather than 
CCD frames, and that the parameters used to describe the 
galaxies were chosen somewhat arbitrarily. 

As can be seen in both the ANN results and those ob- 
tained by our decision trees (tables 1, 4, and 5), while non- 
neighbor classes can, potentially, be easily separated, neigh- 
bor classes cannot. In other words, while trees grown to dis- 
criminate E-type galaxies from Sa+Sb- types might typically 
be very accurate, trees that distinguish between neighboring 
types such as E and SO would have very poor accuracies. In 
fact, as SLSS noticed, scoring accuracy in terms of nearest 
neighbor classifications results in roughly a 90% accuracy. 

Table 6 demonstrates that multiple decision trees can, 
in fact, be generated to easily distinguish between differ- 
ent regions along the continuum of classifications. The de- 
cision trees used to produced these results were trained on 
the small 1700 object set used above. Clearly, extremely ac- 
curate and simple trees can be induced from this simplified 
data. With these trees, galaxies can now be confidently clas- 
sified to larger, overlapping regions. For example, by using 
a majority vote among all six trees, a galaxy might be clas- 
sified either to the E-SO region or to the SO-Sa+Sb-Sc+Sd 
region. In one experiment, the trees in table 6 were manu- 
ally assembled to produce an E-Sa+Sb-Irr classifier with a 
90.7% accuracy. 

In one last attempt to overcome the five-classification 
"fuzziness" of the data, we tried growing two new trees: one 
to separate E and Irr-type galaxies from spirals, and another 
to identify the SO, Sa+Sb, and Sc+Sd-types in the spiral 
subset. The result was a modest increase in accuracy to 66%. 
While this result could most likely have been achieved by in- 
creasing the number of random searches performed by OC1, 
by initially filtering out the spirals, we were able to direct 
the search in a direction we wanted to explore. By reducing 
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the search space in this manner, we also reduced processing 
time. 

Finally, even though neither the decision tree nor the 
ANN produced remarkable results, global classifications by 
the two differ for less than 3% of the test set. (Attempts 
have not yet been made to determine accuracy by compar- 
ing classifications example by example.) Furthermore, mis- 
classifications made by the oblique decision tree in table 5 
match roughly 83% of the misclassifications made by the 
SLSS ANN. The similar results obtained by these two very 
different classifiers does, perhaps, point to the existence of 
error in the original data. At the very least, our results con- 
firm the discovery made by the SLSS ANN that the dis- 
tinction between neighboring classes appears to be poorly 
defined. The two classification algorithms may ultimately 
be discovering a more accurate way by which to classify the 
original Lauberts & Valentijn data. 



SOFTWARE 

The OC1 software discussed in this paper is available 
over the Internet via anonymous ftp. The package con- 
tains full source code and extensive documentation. For in- 
structions on how to retrieve the package, the authors re- 
quest inquiries to be made to either salzberg@cs.jhu.edu or 
murthy@cs.jhu.edu. 
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